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Bayesian inference for exponential random graph models 



Abstract 

Exponential random graph models are extremely difficult models to handle from a 
statistical viewpoint, since their normalising constant, which depends on model parame- 
ters, is available only in very trivial cases. We show how inference can be carried out in a 
Bayesian framework using a MCMC algorithm, which circumvents the need to calculate 
the normalising constants. We use a population MCMC approach which accelerates con- 
vergence and improves mixing of the Markov chain. This approach improves performance 
with respect to the Monte Carlo maximum likelihood method of |Geyer and Thompson 
(p92|. 



1 Introduction 

Our motivation for this article is to propose Bayesian inference for estimating exponential 
random graph models (ERGMs) which are some of the most important models in many 
research areas such as social networks analysis, physics and biology. The implementation 
of estimation methods for these models is a key point which enables us to use parameter 
estimates as a basis for simulation and to reproduce features of real networks. Unfortunately, 
the intractability of the normalising constant and degeneracy are two strong barriers to 
parameter estimation for these models. The classical inferential methods such as the Monte 
Carlo maximum likelihood (MC-MLE) of Geyer and Thompson| (|1992|) and pseudolikelihood 



estimation (MPLE) of Besad (1974) and 



Strauss and Ikeda| ([1990]) , are very widely used in 



partice, but are lacking in certain respects. In some instance it is difficult to obtain reasonable 
results or to understand the properties of the approximations, using these methods. 

The Bayesian estimation approach for these models has not yet been deeply and fully 
explored. Despite this, Bayesian inference is very appropriate in this context since it allows 
uncertainty about model parameters given the data to be explored through a posterior dis- 
tribution. Moreover the Bayesian approach allows a formal comparison procedure among 
different competing models using posterior probabilities. Therefore the methods presented 



in this work aim to contribute to the development of a Bayesian-based methodology area for 
these models. 

Networks are relational data represented as graphs, consisting of nodes and edges. Many 
probability models have been proposed in order to summarise the general structure of graphs 
by utilising their local properties. Each of these models take different assumptions into 



account: the Bernoulli random graph model (Erdos and Renyi 1959) in which edges are 



considered independent of each other; the p\ model (Holland and Leinhardt, 1981) where 



dyads are assumed independent, and its random effects variant the P2 model (van Duijn 



et al. 2004); and the Markov random graph model (Frank and Strauss, 1986) where each 



pair of edges is conditionally dependent given the rest of the graph. The family of exponential 



random graph models (Wasserman and Pattison (1996), see also Robins et al. (2007b) for a 



recent review) is a generalisation of the latter model and has been thought to be a flexible way 
to model the complex dependence structure of network graphs. Recent developments have led 



to the introduction of new specifications by Snijders et al. (2006) and the implementation of 



curved exponential random graph models by Hunter and Handcock (2006). New modelling 



alternatives such as the latent variable models have been proposed by Hoff et al. (2002) 



under the assumption that each node of the graph has a unknown position in a latent space 
and the probability of the edges are functions of those positions and node covariates. The 



latent position cluster model of Handcock et al. (2007b) represents a further extension of 



this approach that takes account of clustering. Stochastic blockmodel methods (Nowicki and 



Snijders! 2001) involve block model structures whereby nodes of the graph are partitioned 



into latent classes and their relationship depends on block membership. More recently, the 



mixed membership approach ( Airoldi et al. 2008 ) has emerged as a flexible modelling tool 



for networks extending the assumption of a single block membership. 

In this paper we are concerned with Bayesian inference for exponential random graph 
models. This leads to the problem of the double intractability of the posterior density as 
both the model and the posterior normalisation terms cannot be evaluated. This prob- 



lem can be overcome by adapting the exchange algorithm of Murray et al. (2006) to the 



network graph framework. Typically the high posterior density region is thin and highly 
correlated. For this reason, we also propose to use a population-based MCMC approach 
so as to improve the mixing and local moves on the high posterior density region reduc- 
ing the chain's autocorrelation significantly. An R package called Bergm, which accom- 
panies this paper, contains all the procedures used in this paper. It can be found at: 
http : //cran . r-pro j ect . org/web/packages/Bergm/. 

This article is structured as follows. A basic description of exponential random graph 
models together with a brief illustration of the issue of degeneracy is given in Section 2. 
Section 3 reviews two of the most important and used methods for likelihood inference. 
In Section 4 we introduce Bayesian inference via the exchange algorithm and its further 
improvement through a population-based MCMC procedure. In Section 5 we fit different 
models to three benchmark network datasets and outline some conclusions in Section 6. 



2 Exponential random graph models 

Typically networks consist of a set of actors and relationships between pairs of them, for ex- 
ample social interactions between individuals. The network topology structure is measured 



by a random adjacency matrix Y of a graph on n nodes (actors) and a set of edges (rela- 
tionships) {Yij : i = 1, . . . , re; j = 1, . . . , re} where Yjj = 1 if the pair (i,j) is connected, and 
Yij = otherwise. Edges connecting a node to itself are not allowed so Ya = 0. The graph 
Y may be directed (digraph) or undirected depending on the nature of the relationships 
between the actors. 

Let y denote the set of all possible graphs on re nodes and let y a realisation of Y . Expo- 
nential random graph models (ERGMs) are a particular class of discrete linear exponential 
families which represent the probability distribution of Y as 

z(6) z{9) 

where s(y) is a known vector of sufficient statistics (e.g. the number of edges, degree statistics, 
triangles, etc.) and 6 G G are model parameters. Since y consists of 2^2 ) possible undirected 
graphs, the normalising constant z{6) = Ylyey ex P{^* s (?/)} ^ s extremely difficult to evaluate 
for all but trivially small graphs. For this reason, ERGMs are very difficult to handle in 
practice. In spite of this difficulty, ERGMs are very popular in the literature since they are 
conceived to capture the complex dependence structure of the graph and allow a reasonable 
interpretation of the observed data. The dependence hypothesis at the basis of these models 
is that edges self organize into small structures called configurations. There is a wide range 



of possible network configurations (Robins et al. 2007a) which give flexibility to adapt to 



different contexts. A positive parameter value for 9i € 6 result in a tendency for the certain 
configuration corresponding to s- L {y) to be observed in the data than would otherwise be 
expected by chance. 

2.1 Degeneracy 

Degeneracy is one of the most important aspects of random graph models: it refers to a 
probability model defined by a certain value of that places most of its mass on a small 
number of graph topologies, for example, empty graphs or complete graphs. 

Consider the mean value parametrisation for the model (fip defined by //(#) = E^[s(y)]. 
Let C be the convex hull of the set {s(y) : y E y}, ri(C) its relative interior and rbd(C) 
its relative boundary. It turns out that if the expected values n(6) of the sufficient statistics 
(J>(0) approach the boundary rbd(C) of the hull, the model places most of the probability 
mass on a small set of graphs belonging to deg(y) = {y 6 y : s(y) G rbd(C)}. It is also 
known that the MLE exists if and only if s(y) £ ri(C) and if it exists it is unique. 

When the model is near-degenerate, MCMC inference methods may fail to find the max- 
imum likelihood estimate (MLE) and returning an estimate of that is unlikely to generate 
networks closely resembling the observed graph. This is because the convergence of the al- 
gorithm may be greatly affected by degenerate parameters values which during the network 
simulation process may yield graphs which are full or empty. 



A more detailed description of this issue can be found in Handcock 



et al. (2009). The new specifications proposed by Snijders et al. (2006) can often mitigate 



(2003) andRinaldo 



degeneracy and provide reasonable fit to the data. 



3 Classical inference 

3.1 Maximum pseudolikelihood estimation 

A standard approach to approximate the distribution of a Markov random field is to use 



a maximum pseudolikelihood (MPLE) approximation, first proposed in Besag (1974) and 



adapted for social network models in Strauss and Ikeda (1990). This approximation consists 



of a product of easily normalised full-conditional distributions 
ir(y\0) « TT pS eudo{y\0) = Y[^(yij\y-ij,0) 

n{yij = l|y_, 



(2) 









n(y ij = 0\y_ ij ,9)] yir 



where y_^ denotes all the dyads of the graph excluding j/y. The basic idea underlying this 
method is the assumption of weak dependence between the variables in the graph so that 
the likelihood can be well approximated by the pseudolikelihood function. This leads to fast 
estimation. Nevertheless this approach turns out to be generally inadequate since it only 
uses local information whereas the structure of the graph is affected by global interaction. 



In the context of the autologistic distribution in spatial statistics, Friel et al. (2009) showed 
that the pseudolikelihood estimator can lead to inefficient estimation. 



3.2 Monte Carlo maximum likelihood estimation 

There are many instances of statistical models with intractable normalising constants. This 



general problem was tackled in Geyer and Thompson ( 1992 ) who introduced the Monte Carlo 



maximum likelihood (MC-MLE) algorithm. In the context of ERGMs, a key identity is the 
following 



z(0 o ) 



E. 



2/|0o 



gg(y) 
Q0 o (y) 



v^ gg(v) io (y) 



(3) 



where qg(-) is the unnormalised likelihood of parameters 0, 0q is fixed set of parameter 
values, and E^.q denotes an expectation taken with respect to the distribution 7r(y\0o). In 
practice this ratio of normalising constants is approximated using graphs y 1 , . . . , y m sampled 
via MCMC from Tr(y\6o) and importance sampling. This yields the following approximated 
log likelihood ratio: 



w. 



00 ( 



(9) - £(0 O ) « (0 - 0o)*s(y) - log 



_. m 
-5>p{(0-0 O )^)} 



i=l 



(4) 



where £(■) is the log-likelihood, wq is a function of 0, and its maximum value serves as a 
Monte Carlo estimate of the MLE. 

A crucial aspect of this algorithm is the choice of 0q. Ideally 0q should be very close to 
the maximum likelihood estimator of 0. Viewed as a function of 0, wq (9) in Ml) is very 
sensitive to the choice of 0q. A poorly chosen value of 9q may lead to an objective function 
Q that cannot even be maximised. We illustrate this point in Section 3.4. 



In pratice, 0$ is often chosen as the maximiser of pi), although this itself maybe a very 
biased estimator. Indeed, Q may also be sensitive to numerical instability, since it effectively 
computes the ratio of a normalising constant, but it is well understood that the normalising 
constants can vary by orders of magnitude with 0. 

In fact, if the value of Oq lies in the "degenerate region" then graphs y t , . . . , y m simulated 
from 7r(y|#o) will tend to belong to deg(y) and hence the estimate of the ratio of normalising 
constants z{6)/ z{Oq) will be very poor. Consequently wq (0) in (|4|) will yield unreasonable 
estimates of the MLE. In the worst situation, wq (6) does not nave an optimum. This 



behaviour is well understood as presented in Handcock (2003) and Rinaldo et al. (2009). 



3.3 A pedagogical example 

Let us consider, for the purpose of illustration, a simple 16-node graph: the Florentine family 



business graph (Padgett and Ansell, 1993) concerning the business relations between some 



Florentine families in around 1430. The network is displayed in Figure [T] and shows that few 
edges between the families are present but with quite high level of triangularisation. 




Figure 1: Florentine family business graph. 
For this undirected network we propose to estimate the following 2-dimensional model: 



n(y\0) 



1 



z(0) 



exp{0isi(y) + 2 S2{y)} 



(5) 



with statistics si(y) = Y.i<jVij an d S2{y) = Y,i<j<kVikyjk which are respectively the ob- 
served number of edges and two-stars, that is, the number of nodes with degree at least 
two. We use the statnet package (Handcock et al. 2007a) in the R programming language 



to compute estimates of the maximum pseudolikelihood estimator and the Markov chain 
Monte Carlo estimator. Despite its simplicity the parameters of this model turn out to be 
difficult to estimate in practice, and both the MPLE and the MC-MLE fail to produce rea- 
sonable estimates. Table [T] reports the parameter estimates of both MPLE and MC-MLE. 
The standard errors of the MC-MLE are unreasonably large due to poor convergence of the 
algorithm. In fact, in this case it turns out that graphs y 1 , . . . ,y m simulated from Tr(y\0Q), 
which are used in Q, are typically complete graphs. This is illustrated in Figure^] The 
left-hand plots show the trace of the MCMC sampled sufficient statistics simulated during 
the MCMC iterations, where 9q is set equal to the MPLE. The right-hand plots show the 
respective density estimate. These parameter estimates generate mainly full graphs. This is 



good evidence of degeneracy, thereby illustrating, in this example, that MPLE and MC-MLE 
do not give parameter estimates which are in agreement with the observed data. 





MC-MLE 


MPLE 


Parameter 


Estimate Std. Error 


Estimate Std. Error 


0i (edges) 
02 (2-stars) 


-3.39 21.69 
0.30 0.79 


-3.39 0.70 
0.35 0.14 



Table 1: Summary of parameter estimates of the model (pi). 
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Figure 2: MCMC output of the graphs simulated from (pi) where 0q is the maximum pseudo- 
likelihood estimate. The top and bottom rows corresponds to si(y) and S2(y), respectively. 



As before, the issue of choice of initial value 6q is crucial to the performance and con- 
vergence of MC-MLE. In Figure put can be observed how wq (0) in M| varies with respect 
to two possible choices of #n. In this case note that when 6$ corresponds to the MPLE, Q 
gives an approximation that cannot even be maximized whereas 6q = (0,0), for example, 
would seem to be a better choice. 



, — , O 

CD im 



O 

o 
I 



o 
o 

CD 

I 




o 
o 

CD 



O 
O 
CM 



O 
O 

eg 

I 



o 
o 

CD 

I 




01 



Op 



Figure 3: Approximation of the log ratio in Ml) by using wq (0) for two different initial 
values: Oq = MPLE (solid curve) and 6q = (0,(1) (dotted curve). 

4 Bayesian inference 



Consider the Bayesian treatment of this problem (see Koskinen (2004)), where a prior dis- 
tribution 7f(0) is placed on 0, and interest is in the posterior distribution 

7r(%) OC 7r(3/|0)7r(0). 

This type of posterior distribution is sometimes called "doubly-intractable" due to the (stan- 
dard) intractability of sampling directly from the posterior distribution, but also due to the 
intractability of the likelihood model within the posterior. 

A naive implementation of a Metropolis-Hastings algorithm proposing to move from to 
6' would require calculation of the following ratio at each sweep of the algorithm 



q e >(yM0') z{e) 

qgivMG) X z{6') 
which is unworkable due to the presence of the normalising constants z{6) and z{9 ). 



(6) 



4.1 The exchange algorithm 

There has been considerable recent activity on the problem of sampling from such com- 



plicated distributions, for example, M0ller et al. (2006). The algorithm presented in this 



paper overcomes the problem of sampling from a distribution with intractable normalising 
constant, to a large extent. However the algorithm can result in an MCMC chain with poor 
mixing among the parameters. In social network analysis, further developments of this ap- 



proach have been proposed in Koskinen (2008) and Koskinen et al. (2009) by introducing the 



linked importance sampler auxiliary variable (LISA) algorithm which employs an importance 
sampler in each iteration to estimate the acceptance probability. 



In this paper we adapt to the ERGMs context the simple and easy-to-implement exchange 



algorithm presented in Murray et al. (2006). The algorithm samples from an augmented 
distribution 

7T(0', y', 9\y) oc ir{y\9)n{9)h(9'\9)ir{y'\9') (7) 

where n(y'\0 ) is the same distribution as the original distribution on which the data y is 
defined. The distribution h(9'\9) is any arbitrary distribution for the augmented variables 
9 which might depend on the variables 0, for example, a random walk distribution centered 
at 0. It is clear that the marginal distribution for in ¥l\ is the posterior distribution of 
interest. The algorithm can be written in the following concise way: 



1. GlBBS UPDATE OF (9' ,y')\ 

i Draw 0' ~ h{-\9). 
ii Draw y' ~ tt(-\0'). 



2. Propose the exchange move from 9 to 9' with probability 



a 



mm 



' q 9 (y')7r(9')h(9\9')q e ,(y) ^ z{9)z{9') 
' q e {y)it{9)h{9'\9)q el {y>) * z(9)z(9') 



(8) 



Notice in step 2, that all intractable normalising constants cancel above and below the 
fraction. In practice, the exchange move proposes to offer the observed data y the auxiliary 
parameter 9' and similarly to offer the auxiliary data y' the parameter 9. The affinity between 
9 and y' is measured by qgiy') / qg{y) and the affinity between 9' and y by qQ'(y)/qa'(y')- 
The difficult step of the algorithm is l.ii since this requires a draw from ir(y'\9 / ). Note 
that perfect sampling (Propp and Wilson ( 1996| )) is often possible for Markov random field 
models, however a pragmatic alternative is to sample from n(y'\9 ) by standard MCMC 
methods, for example, Gibbs sampling, and take a realisation from a long run of the chain 
as an approximate draw from the distribution. If we suppose that the proposal density h(-) 
is symmetric, then we can write the acceptance ratio a in (TsT) as 



a 



mm 



\ q e >(yM9')q e (y') 

' qe{y)n(9)q e >(y') 



Comparing this to (pi), we see that qg(y')/qo'(y') can be thought of as an importance sam- 
pling estimate of the ratio z{9)/ z{9') since 



E 



y\0' 



ggigO = y^ Qe(y') gfl'te') _ g(g) 



Recall that this is the identity pi) used in the Geyer-Thompson method. Note that this algo- 
rithm has some similarities with another likelihood-free method called Approximate Bayesian 



Computation (ABC) (see |Beaumont et al.| ( |2002[ ) and |Marjoram et al.| ( |2003[ )). The ABC 
algorithm also relies on drawing parameter values 0' and simulating new graphs from those 
values. The proposed move is accepted if there is good agreement between auxiliary data 



and observed data in terms of summary statistics. Here a good approximation to the true 
posterior density is guaranteed by the fact that the summary statistics are sufficient statistics 
of the probability model. 

4.2 MCMC simulation from ERGMs 

Recall that step l.ii of the exchange algorithm requires a draw y' from Tr(y'\6). The simu- 
lation of a network given a parameter value is accomplished by an MCMC algorithm which 
at each iteration compares the probability of a proposed graph to the observed one and then 
decides whether or not accept the proposed network. The latter is selected at each step by 
proposing a change in the current state of a single dyad (i.e. creating a new edge or dropping 
an old edge). This is obviously a computationally intensive procedure. 



In order to improve mixing of the Markov chain the ergm package for R (Hunter et al. , 2008) 
uses by default the "tie no tie" (TNT) sampler. This is the approach we take throughout 
this paper. At each iteration of the chain the TNT sampler, instead of selecting a dyad at 
random, first selects with equal probability the set of edges or the set of empty dyads, and 
than swaps a dyad at random within that chosen set. In this way, since most of the realistic 
networks are quite sparse, the probability of selecting an empty dyad to swap is lower and the 
sampler does not waste too much time proposing new edges which are likely to be rejected 



(Morris et al. 2008). Moreover, initialising the auxiliary chain with the observed network y 
leads to improved efficiency in terms of acceptance rates. This is the default choice in our 
implementation of the algorithm, which we now describe. 

4.3 Implementing the algorithm 

The exchange algorithm can be easily implemented by using existing software and the ergm 
package is particularly appropriate for this purpose. An R package called Bergnrl which 
accompanies this paper, draws heavily on the ergm package and provides functions for the 
implementation of the exchange algorithm which have been used to obtain the results re- 
ported in this paper. 

Let us return to the running example from Section 3.4. The posterior distribution of 
model ([5]) is written as 

7r ^' y - ) a ~z(d) exp i 1 ^ Vij + ° 2 ^ yikVjk i 7r ^- ) ' ^ 

^ ' \ i<3 i<j<k J 

Here we assume a flat multivariate normal prior n(0) ~ A/"(0, 30-Z^) where 1^ is the identity 
matrix of size equal to the number of model dimensions d. The simplest approach to update 
the parameter state is to update its components one at a time, using a single-site update 
sampler. We use the following proposals for the first and the second parameter value, re- 
spectively, h{-\9\) ~ A/"(0, 1) and /i(-|#2) ~ A/"(0, 0.1) where the proposal tuning parameters 
are chosen in order to reach a reasonable level of mixing, in this case an overall acceptance 
rate of 18%. The auxiliary chain consists of 1,000 iterations and the main one of 30,000 
iterations. 



http : //cran . r-pro j ect . org/web/packages/Bergm/ 



The posterior estimates are reported in Table [2j From Figure [4] it can be seen that the 
estimates produced by the exchange algorithm are very different from the ones obtained by 
MC-MLE and MPLE. The autocorrelation plots show that the autocorrelations are negligible 
after lag 150. 
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Figure 4: MCMC output of the pedagogical model ^. 

As mentioned in Section 4.1, an important issue for implementation of this algorithm is 
that of drawing from n(y'\0 ) and a pragmatic way to do that is to use the final realisation of 
a graph simulated from a long run of an MCMC sampler. Here we are interested in testing 
the sensitivity of the posterior output to the number of auxiliary iterations used to generate 
the sampled graph. Table [2] displays the results using three different number of iterations for 
the auxiliary chain. As we can see from the three outputs, there is not a significant difference 
between the results obtained using 500 or more iterations. 



Iterations 


Parameter 


Post. Mean Post. Sd. 


500 


6\ (edges) 
6 2 (2-stars) 


-2.43 0.52 
0.11 0.12 


1,000 


6\ (edges) 
2 (2-stars) 


-2.42 0.51 
0.11 0.11 


5,000 


6\ (edges) 
9 2 (2-stars) 


-2.43 0.51 
0.10 0.12 



Table 2: Posterior outputs for different numbers of iterations for the auxiliary chain. 

It is worth mentioning here that the posterior density is highly correlated and that the 
high posterior density region is thin. The high posterior density region for the pedagogical 
example is shown in the left-hand plot of Figure [5] In such a situation, which is not rare for 
these kind of models, in absence of any information about the covariance structure of the 
posterior distribution, the single-site update may not be the best procedure as it may result 



10 



in a poorly mixing MCMC sampler. We will discuss this issue in Section 4.5. 

4.4 Convergence of the Markov chain 

Exploration of the parameters of the posterior distribution using MCMC is of crucial impor- 
tance, but more so in light of the problem of degeneracy. Here we have that, to a large extent, 
the exchange algorithm results in quite fast convergence to the stationary distribution. We 
present a heuristic argument as to why this is the case. Assume that s(y) £ rint(C), that 
tt(0) is very flat and that h(-) is symmetric. We can then write the acceptance ratio Q as: 

a « min (l,exp{(0 - 0')\s{y') - s{y))}) (10) 

where s(y) is the fixed vector of the sufficient statistics of the observed graph and s(y') is 
a vector of sufficient statistics of a graph drawn from 7r(y'\G ). It is known, see Section 2 of 



van Duijn et al. (2009), for example, that 

(0-0')*(M0)-M0'))>o. (11) 

Recall that /i(0) = E^[s(y)] is the mean parameterisation for 0. Notice that if there is good 
agreement between y and 6' , and good agreement between y' and 6, then 

s(y') « fx(0) and s(y) « /x(0')- 



Equation ( 11 ) would then implies that the exponent in {jlOh is positive, whereby the exchange 
move is accepted. Empirically, it has been observed that fast convergence occurs even when 
the initial parameters are set to lie in the degenerate region. The right-hand plot of Figure [5] 
displays the two dimensional trace of 12 MCMC runs starting from very different points in the 
parameter space for the pedagogical example. In each instance, the Markov chain converges 
quickly to the high posterior density region. In fact, we have observed this behaviour for 
each of the datasets we analysed. 

Recall that the exchange algorithm involves drawing networks from the likelihood model. 
It is natural to ask questions such as, what proportion of such simulated networks are full or 
empty? And more importantly are any of these networks accepted in the MCMC algorithm? 
Figure [6] partly answers these questions, in the case of the pedagogical example of Section 
3.4. Here it can be seen that the algorithm frequently simulates low or high density graphs, 
but most of the networks accepted are those whose sufficient statistics are in close agreement 
to the observed sufficient statistics. This is a very useful aspect of the algorithm. 

4.5 Population MCMC can improve mixing 

While easy to implement, the single-site update procedure can lead to slow mixing if there 
is strong temporal dependence in the state process. As shown above, often there may be a 
strong correlation between model parameters and also the high posterior density region can 
be thin. In order to improve mixing we propose to use an adaptive direction sampling (ADS) 
method (Gilks et al. (1994) and Roberts and Gilks (1994)), similar to that of ter Braak 



and Vrugt (2008). We consider a population MCMC approach consisting of a collection of 
H chains which interact with one another. Here the state space is {#i, . . . , 6h} with target 
distribution 7r(0i|y)<8>- • -(8>7r(0^|j/). A "parallel ADS" move may be described algorithmically 
as follows. 



11 





8i 61 

Figure 5: Posterior density (left) and traces (right) of the first 1,000 iterations of 12 chains 
from the posterior distribution corresponding to the pedagogical example. 



(a) 



(b) 







30000 




30000 



Figure 6: Number of edges (left) and two-stars (right) of 2, 000 graphs (thinned by a factor of 
15 from the original 30, 000 iterations) simulated (gray dots) and graphs whose parameters 
were accepted (black dots). The solid line represents the number of edges in the observed 
graph. 



12 



For each chain h = 1, . . . , H 

1. Select at random h\ and h,2 without replacement from {1, . . . , H} \ h 

2. Sample e from a symmetric distribution 

3. Propose 0' h = 6\ + 1 (G^ - 6\ 2 ) + e 

4. Sample y' from Tv(-\G' h ) by MCMC methods, for example, the "tie no tie" (TNT) 
sampler, taking a realisation from a long run of the chain as an approximate draw 
from this distribution. 

5. Accept the move from 6\ to 6 1 ^ = 6' h with probability 



mm 



q e i h (y')n(o' h )q e > h (yy 
i0i(yMoi)q e > h (y') 

This move is illustrated graphically in Figure [7j 




O 




Figure 7: (a) The parallel ADS move of 6 % h is generated from the difference of the states 
6\ and 61 selected from the other chains and a random term e. (b) The reverse jump is 
obtained by reversing e and the order of the difference between 61 and 6\ . 



For our running example we set 7=1 and e ~ A/"(0, 0.1/^) where the tuning parameters 
were set so that the acceptance rate is around 20%. Initially all the chains are run in parallel 
using a block update of each individual chain and after a certain number of iterations the 
parallel ADS update begins. For this example we set the number of chains H = 5. The 
posterior output from the population MCMC algorithm is shown in Table [3j 

Figure [8] shows that the autocorrelation curve now decreases faster than the single-site 
update algorithm and is negligible at around lag 50. Recall that for the single-site update 
version of this algorithm, the autocorrelation is negligible after lag 150. Therefore the single- 
site sampler would need to be run for around 3 times more iterations than the population 
MCMC sampler in order to achieve a comparable number of effectively independent draws 
from the posterior distribution. For more complex networks, the reduction in autocorrelation 
using the population MCMC algorithm was considerable. 
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Parameter 


Post. Mean Post. Sd. 


Chain 1 


9\ (edges) 
9 2 (2-stars) 


-2.39 0.55 
0.11 0.12 


Chain 2 


0i (edges) 
2 (2-stars) 


-2.49 0.51 
0.13 0.12 


Chain 3 


0\ (edges) 
2 (2-stars) 


-2.43 0.57 
0.11 0.12 


Chain 4 


0\ (edges) 
2 (2-stars) 


-2.44 0.54 
0.13 0.13 


Chain 5 


0i (edges) 
2 (2-stars) 


-2.42 0.52 
0.12 0.13 


Overall 


0\ (edges) 
02 (2-stars) 


-2.44 0.54 
0.12 0.12 



Table 3: Summary of posterior parameter density of the model ^. 
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Figure 8: MCMC output of a single chain of the model (J9|. 
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Figure 9: 2-D traces of 400 iterations for the single-site update algorithm (left) and a single 
chain of the parallel ADS update algorithm (right). 

Figure [9] depicts the 2-dimensional trace of 400 iterations for both the single-site update 
and a single chain using the parallel ADS update. We see that the latter can easily move into 
the correlated high posterior density region thus ensuring a better mixing of the algorithm. 

The algorithm took approximately 13 minutes to estimate model ([9]) using 30, 000 iter- 
ations of the single-site update and about 6 minutes using the parallel ADS updater with 
6, 000 iterations per chain. 



5 Examples 

We now consider three different benchmark networks datasets (two undirected and one di- 



rected) and we illustrate models with various specifications (see Robins et al. (2007a) and 



|Snijders et al . (2006) for details). For each of them we apply the population MCMC with 
parallel ADS update, setting the number of chains equal to twice the number of model 
parameters. 



5.1 Molecule synthetic network 

This dataset is included with the ergm package for R. The elongated shape of this 20-node 



graph, shown in Figure 10, resembles the chemical structure of a molecule. 
We consider the following 4-dimensional model 



where 



tt(0|v) oc — — exp I J2 d iSi (y) \ vr(0) 



(12) 



number of edges 
number of two-stars 



8 l(v)='Ei<jVij 

s 2{y) = T. l<j<k Vikyjk 

s s(y) = Y,i<j<k<l VuVjiVki number of three-stars 
s ^(v) = Y,i<j<k yjkVikVij number of triangles 
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Figure 10: Molecule synthetic graph. 



We use the flat prior ir(0i) ~ A/"(0, 30), for % = 1, ... , 4, and we set 7 = 0.5 and e ~ Af(0, 0.17) 
corresponding to an overall acceptance probability of 22%. The auxiliary chain consists of 
1,000 iterations and the main one of 4,000 iterations for each chain. The algorithm took 



approximately 9 minutes to sample the posterior model (12) for a total of 32, 000 iterations. 



In Table [4] we see that the posterior density estimates of the 8 chains are similar to each 
other. Aggregating the output across all chains, the posterior estimates for the first three 
parameters indicate an overall tendency for nodes to have a limited number of multiple edges 
while the positive estimates relative to the triangle parameter capture the propensity towards 
transitivity. Table [5] reports the estimates of both MC-MLE and MPLE. Here MC-MLE fails 
to converge as the MPLE estimates generate mainly full graphs. 



In Figure 11 it can be seen that the autocorrelations of the parameters decay quickly 
around lag 200. By comparison, a single chain sampler (not reported here) with single-site 
updating of the vector led to negligible autocorrelation at around lag 1,000. Therefore, 
the single-site sampler would need to be run for around 5 times more iterations than the 
population MCMC sampler in order to achieve comparable number of effectively independent 
draws from the posterior distribution. 

A pragmatic way to examine the fit of the data to the posterior model the output obtained 
is to implement a Bayesian goodness-of-fit procedure. In order to do this, 100 graphs are 
simulated from 100 independent realisations taken from the estimated posterior distribution 
and compared to the observed graph in terms of high-level characteristics which are not 
modelled explicitly, namely, degree distributions (for degrees greater than 3); the minimum 
geodesic distance (namely, the proportion of pairs of nodes whose shortest connected path is 
of length I, for I = 1,2,...); the number of edge-wise shared partners (the number of edges 
in the network that share I neighbours in common for / = 1,2,....) The results are shown 



in Figure 12 and indicate that the structure of the observed graph can be considered as a 
possible realisation of the posterior density. 



5.2 Dolphins network 



The undirected graph displayed in Figure 13 represents social associations between 62 dol- 



phins living off Doubtfull Sound in New Zealand ( Lusseau et al. , 2003 ) . The graph is inho 
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Figure 11: Molecule dataset: MCMC output of a single chain of the posterior distribution 
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Parameter 


Post. Mean 


Post. Sd. 




9\ (edges) 


2.88 


3.10 


Chain 1 


2 (2-stars) 


-1.06 


0.92 




03 (3-stars) 


-0.04 


0.43 




04 (triangles) 


1.55 


0.54 




9\ (edges) 


2.61 


3.12 


Chain 2 


02 (2-stars) 


-1.02 


0.99 




O3 (3-stars) 


-0.05 


0.42 




04 (triangles) 


1.70 


0.57 




9i (edges) 


2.71 


3.13 


Chain 3 


2 (2-stars) 

03 (3-stars) 


-1.01 
-0.06 


0.96 
0.48 




04 (triangles) 


1.58 


0.53 




0i (edges) 


2.68 


3.35 


Chain 4 


2 (2-stars) 

03 (3-stars) 


-1.01 
-0.03 


1.02 

0.47 




04 (triangles) 


1.45 


0.59 




0i (edges) 


2.72 


3.11 




2 (2-stars) 


-1.10 


1.14 




03 (3-stars) 


-0.02 


0.44 




04 (triangles) 


1.59 


0.63 




0i (edges) 


2.87 


3.46 


Chain 6 


2 (2-stars) 


-1.05 


1.00 


03 (3-stars) 


-0.10 


0.47 




04 (triangles) 


1.65 


0.55 




0i (edges) 


2.60 


3.25 


Chain 7 


2 (2-stars) 

03 (3-stars) 


-1.10 
-0.10 


1.06 

0.47 




04 (triangles) 


1.62 


0.51 




0i (edges) 


2.82 


3.64 


Chain 8 


2 (2-stars) 

03 (3-stars) 


-1.03 
-0.10 


1.03 
0.50 




04 (triangles) 


1.61 


0.59 




0\ (edges) 


2.72 


3.27 




02 (2-stars) 


-1.02 


1.02 




03 (3-stars) 


-0.05 


0.46 




04 (triangles) 


1.60 


0.57 



Table 4: Molecule dataset: summary of posterior parameter density of the molecule network 
PI 
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MC-MLE 


MPLE 


Parameter 


Estimate 


Std. Error 


Estimate 


Std. Error 


01 (edges) 


4.11 


NA 


5.08 


1.90 


02 (2-stars) 


-1.64 


NA 


-2.02 


0.60 


03 (3-stars) 


0.42 


NA 


0.52 


0.27 


04 (triangles) 


1.30 


NA 


1.60 


0.39 



Table 5: Molecule dataset: Summary of parameter estimates of the model (12) 





12 3 4 
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minimum geodesic distance 




edge-wise shared partners 



Figure 12: Molecule dataset: Bayesian goodness-of-flt output. 



mogeneous, a few nodes have large number of edges and many have only one or two edges. 
We propose to estimate the following 3-dimensional model using three new specifications 



proposed by Snijders et al. (2006): 



7T(%) 



1 



X 



z{9) 



where 






exp {0isi (i/) + e 2 u(y, <f> u ) + 3 v(y, 4> v )} vr(0) (13) 



number of edges 
bu ) > D{(y) geometrically weighted degree 

'") > EPi(y) geometrically weighted edgewise 
shared partner. 



- e 

e" 



We set <j) u = 0.8 and 4> v = 0.8 so that the model is a regular, that is, a non-curved exponential 
random graph model (Hunter and Handcock ,12006). We use the flat multivariate normal prior 
n(0) ~ A/"(0, 307) and we set 7 = 0.5 and e ~ A/"(0, 0.17), where the tuning parameters were 
chosen so that the overall acceptance rate was around 10%. The auxiliary chain consists 
of 15, 000 iterations and the main chain 10, 000 iterations for each of the 6 chains of the 
population. The algorithm took approximately 1 hour and 10 minutes to estimate model 



(13). A single-site algorithm would take many more hours to carry out estimation using the 



same overall number of iterations. From Table [6] it can seen that mixing across the chains 
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Figure 13: Dolphins network dataset. 



appears to be reasonable and the overall posterior estimates displayed in the lower block of 
Table [6] provide us with a useful interpretation of the observed graph. The tendency to a low 
density of edges expressed by the negative posterior mean of the first parameter, is balanced 
by a propensity to multiple local clustering and tied nodes to share multiple neighbours in 
common expressed, respectively, by the last two positive posterior parameter estimates. 



Posterior estimates and autocorrelation function plots are reported in Figure 14 Each of 



the 3 autocorrelation functions decrease quite quickly and are negligible at around lag 200. 
A single chain MCMC version of the algorithm (not reported here) led to autocorrelation 
functions which decayed after lag 1500 - roughly a 7 fold increases with respect to the 
population MCMC version of the algorithm. 

As an aside, note that both the MC-MLE and MPLE estimates were quite similar to the 
posterior mean estimates for this dataset. 

Similar to the previous example, we carry out a Bayesian goodness of fit test. The results 



of this are displayed in Figure 15 and suggest that the model is a reasonable fit to the data. 



5.3 Sampson's Monk network 

This network represents the liking relationships among a group of 18 novices who were 



preparing to join a monastic order (Sampson, 



between actors, and is presented in Figure 16 



1968). This network consists of directed edges 



We propose to estimate the following 3-dimensional model 

1 



ir(6\y) oc — — exp{0isi(j/) + 6 2 s 2 (y) + 3 s 3 (y)} 7r{6) 



z(0 



(14) 



where 



s i(y) 

«2(V) 

ss(v) 



2si<j Vijl/jkUki 



number of edges 
number of mutual edges 
number of cyclic triads. 



We use the flat multivariate normal prior tt(6) ~ A/"(0, 3OJ3) and we set 7 = 0.8 and e ~ 
M(0, O.I/3), giving an overall acceptance rate of 18%. The auxiliary chain consists of 2,000 
iterations and the main one of 5, 000 iterations for each of the 6 chains of the population. 
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Parameter 


Post. Mean 


Post. Sd. 




01 (edges) 


-4.24 


0.32 


Chain 1 


2 (gwd) 


1.28 


0.50 




03 (gwesp) 


0.94 


0.13 




01 (edges) 


-4.24 


0.34 


Chain 2 


2 (gwd) 


1.27 


0.50 




3 (gwesp) 


0.94 


0.13 




01 (edges) 


-4.27 


0.33 


Chain 3 


02 (gwd) 


1.32 


0.50 




3 (gwesp) 


0.95 


0.13 




01 (edges) 


-4.27 


0.35 


Chain 4 


02 (gwd) 


1.27 


0.52 




3 (gwesp) 


0.95 


0.13 




01 (edges) 


-4.30 


0.37 


Chain 5 


02 (gwd) 


1.32 


0.55 




3 (gwesp) 


0.95 


0.13 




01 (edges) 


-4.24 


0.34 


Chain 6 


02 (gwd) 


1.32 


0.51 




3 (gwesp) 


0.95 


0.13 




01 (edges) 


-4.27 


0.35 


Overall 


02 (gwd) 


1.30 


0.52 




03 (gwesp) 


0.95 


0.13 



Table 6: Dolphins dataset: summary of posterior parameter density of the model (13) 
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Figure 14: Dolphins dataset: MCMC output of a single chain of the model (13) 
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Figure 15: Dolphins dataset: Bayesian goodness-of-fit output. 
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Figure 16: Sampson's Monk network. 

Notice in Table[7]that there is very little difference in output between the chains, giving good 
evidence that there is reasonable mixing between parameters across the chains. Table[7]shows 
that the posterior estimate for the cyclic triads parameter is not significative. The tendency 
to a low number of edges as expressed by the edge parameter is balanced by a high level of 
reciprocity expressed by mutual parameter. 

Figure |i~7| displays posterior density estimates from a single chain of the MCMC algorithm, 
together with autocorrelation functions for each parameter. All 3 autocorrelation functions 
decrease quite quickly and are negligible at around lag 60. This behaviour was very similar 
to each of the other 5 chains. By comparison, a single chain MCMC run (not reported here) 
with an equivalent number of iterations led to significantly higher autocorrelations, which 
were negligible at lag 400. Roughly 6 or 7 times more iterations for a single chain run would 
therefore be needed to yield effectively the same number of independent draws from the 
posterior as the population MCMC version of the algorithm. 



Here the algorithm took approximately 8 minutes to estimate model (14). As for the pre- 
vious example, we note that the MC-MLE and MPLE estimates are similar to the posterior 
mean estimates. 

As before, we propose a series of Bayesian goodness-of-fit tests to understand how well the 



estimated model fits a set of observations. The results in Figure 18 suggest that the model 
is a reasonable fit to the data. Since the network has directed edges, we used in-degree and 
out-degree statistics, instead of degree distributions. 

6 Discussion 

This paper has presented a Bayesian inferential framework for social network analysis for 
exponential random graph models. The approach used here, based on the exchange algorithm 



Murray et al. (2006), is shown to give very good performance. Here we present some datasets 
for which classical inference approaches, Maximum pseudolikelihood estimation and Monte 
Carlo - Maximum likelihood estimation, both of which are standard practice in social network 
analysis, fail to give reasonable parameter estimates. By contrast the Bayesian approach yield 
parameter estimates consistent with the data, in the sense that networks simulated from the 
posterior had similar topologies to the observed data. 

The issue of model degeneracy is important for ERG Ms, and one which can cause poten- 
tial difficulties for inference methods, especially for the MC-MLE method - a poorly chosen 
set of initial parameters, from the degeneracy region will lead to poor estimation. Empir- 
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Parameter 


Post. Mean Post. Sd. 


Chain 1 


9\ (edges) 
02 (mutual) 
#3 (ctriple) 


-1.72 0.31 
2.34 0.43 
-0.05 0.16 


Chain 2 


6i (edges) 
02 (mutual) 
#3 (ctriple) 


-1.73 0.30 
2.38 0.42 
-0.02 0.16 


Chain 3 


0\ (edges) 
02 (mutual) 
#3 (ctriple) 


-1.74 0.29 
2.37 0.43 
-0.04 0.15 


Chain 4 


0\ (edges) 
02 (mutual) 
#3 (ctriple) 


-1.72 0.29 
2.33 0.44 
-0.06 0.16 


Chain 5 


0\ (edges) 
02 (mutual) 
#3 (ctriple) 


-1.73 0.30 
2.30 0.43 
-0.06 0.16 


Chain 6 


0\ (edges) 
02 (mutual) 
#3 (ctriple) 


-1.72 0.30 
2.27 0.44 
-0.06 0.16 


Overall 


0\ (edges) 
02 (mutual) 
#3 (ctriple) 


-1.72 0.30 
2.33 0.43 
-0.04 0.16 



Table 7: Monks dataset: Summary of posterior parameter density of the model (14) 
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Figure 17: Monks dataset: MCMC output of a single chain from (14) 
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Figure 18: Monks dataset: Bayesian goodness-of-fit output. 
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ical evidence presented in this paper, suggests that the method we propose here results in 
a Markov chain which, even if initialised with parameters from the degenerate region, will 
quickly converge to the posterior high density regions. 

Our analysis has shown that the high posterior density region of ERGMs has typically 



very thin and correlated support. This point is well addressed in Rinaldo et al. (2009), for 
example. This presents a very difficult situation for exploration of the posterior support 
using MCMC methods. The population MCMC method presented in this paper is a useful 
first step towards addressing this problem, and further exploration in this direction should 
prove useful. All of the methods used in this paper are presented in the R package Bergm 
which accompanies this paper, and should prove useful to practitioners. 

Estimation for larger networks is feasible but at the cost of increased computational time. 
In experiments not reported here, we carried out inference for a graph with 104 nodes in under 
2 hours. Generally speaking, it seems reasonable to expect that the number of iterations 
needed for the auxiliary chain should be proportional to the number of dyads of the graphs, 
n 2 for a graph with n nodes. This is the computational bottleneck of the algorithm. The 
population MCMC approach which we outlined is very well suited to parallel computing, 
and this may in turn lead to reduction in the computational time needed to service ERGMs 
using the algorithm described in this paper. 
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